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ABSTRACT 

list using computers are described, words were taken from 127 books 
in fourteen series of widely used elementary textbooks. The 
compilation procedures consisted of (1) input; putting the lists into 
the computer, (2) processing of the vocabulary into compiled lists, 

(3) output; production of the actual word lists. Rules set up to 
determine whether inflected forms of words would be included are 
described. Capitalized proper nouns, abbreviations, word parts, and 
hyphenated words were deleted. Scanning programs were used to correct 
and proofread initial lists. The processing of the words resulted in 
four kinds of lists; (1) the Core List (words which were included in 
three or more of the six reader series) , (2) the Additional List 
(words found in four or more different series excluding Core words), 
(3) four Technical Lists, and (4) a Total Alphabetical" List in which 
all the lists were merged and put in alphabetical order, A comparison 
between this list and four other word lists is made. Sample 
printouts, tables of data, and references are included, (AL) 
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Developing and Comparing Elementary School 
Word Lists by Computer 



I, Compilation Procedures 

The Harris— Jacobson word list (1972) is based on a computerised 
analysis of the total vocabulary content of 127 books in fourteen 
recently published and widely used series of elementary school 



textbooks. Since the fourteen series include six in reading, and 



two each in English, mathematics, science, and social studies. 



the vocabulary constitutes a rich variety of wordstock providing 
large numbers of general and technical vocabulary words which do 










not occur in most existing word lists. In addition, the inclusion-s 
of all of the books of six newer reading series which reflect the 
trend toward less exacting control over basal reader vocabulary 
increassd the likelihood of obtaining words not in existing word 
, Thus the lists derived from these 14 series should have 
many words in common with other word lists but should also have 
many new and different words which the less comprehensive or 
older lists do not have. 

The words determined to be the basic essential vocabulary 
for elementary reading were organised into a General List, a 
Technical List, and a Total List- through a series of computer 
processes. These procedures may be defined conceptually as 

■ i 
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1) input, getting the lists into the computer, 2) processing of 
the vocabulary into compiled lists,. and 3) output, or production 
of the actual word lists. 

Before work compiling the lists could proceed, two sets of 
rules had to be established. One set governed the situations in 
which inflected forms were or were not to be merged with their 
root words, the other set established which words were deleted. 

At the preprimer level roots were combined with plural inflections 
(root word plus s) , Words at the primer level included root 

plus — s , - es , — 1 s , — d, — ed , — er (comparative) . At the first 
reader level, the rule was the same as that for the primer level 
with the addition that - ing and -est endings were listed with 
root words. At the second grade level all first grade variants 
were listed plus variants with the endings -ed, - ing , -er , and 
—est which follow a doubled consonant, variants which change 
to before adding -ed, -er, -es, or est , and variants ending 
~ e y / and — ily . Variants at levels three and up were the 
same as those included at grade two. Variants occurring at a 
level lower than the level at which such variants were proeedurally 
included were included according to the frequency criteria of 
root words. Variants dropping -e, be fore adding — (bone, bony; 
rose, rosy) were treated as unique words. Variants ending in - er 
were classified as comparatives, agents, or root words by 
personal judgment. 

The other set of rules established which classes of words 
were deleted. Capitalized proper nouns were deleted, a as were 
abbreviations and word parts which appear in textbook reader and 
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English lessons. Hyphenated words were deleted except where 
their meaning can not be easily inferred from the meaning of the 
joined root words (good-by, tom-tom) . 

The first step in compiling the lists was input, or getting 
the words from the books into the computer. When the publisher 
provided a list of the words new to the series , the list was 
typed in sequence on IBM cards. This was true for all of the 



primary-grade readers and half of the intermediate -grade readers. 
When such lists were not available (the other half of the 
intermediate readers, and all of the content textbooks) , every 
word in the book was typed _ .1 sequence either on IBM cards or 
on photosensitive, machine-readable paper in machine-readable 
type. From the cards or paper the data were fed into a computer 
and registered in memory tapes. A comparative study showed the 
IBM card procedure to be the less costly, because the photo- 
sensitive paper required several intermediate machine operations 
which were expensive. 



The word listsfor each book was alphabetized by the computer. 



The resulting printout was then corrected by a series of four 



procedures which ensured that erroneous entries were reduced to 
an absolute barest minimum. Initial text corrections were made 



by a single oral proofreading, found to be much faster than 



machine verification on a keypunch verifier and capable of 
discovering 2/3 of the errors in the first reading. Since this 
oral proofreading process required 27 hours of clerical time per 
100,000 word book, and there were 127 books, repetitions of 
such proofreadings were considered inefficient * : 
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The second correction procedure utilized a new computer 
program which greatly reduced the manual labor required. This 
program is based on the existing Key Word In Context (KWIC) 
programs. As it is a specialized, abbreviated adaptation it was 
entitled "Quickie . " 



The Quickie program scans— input text and produces a 
reedited and sequenced file consisting of IBM card images (these 
images are two-thirds the length of a line of 120 spaces of 
ordinary computer printouts) . This file is printed by the 
computer. Every line on the computer printout is numbered in 
sequence and consists of the exact textual data as punched on 
one IBM card. 

Once the card image printouts have been printed, the Quickie 
program uses this file to reduce to a fraction the material to 



be proofread. 

The body of unique words subject to proofreading and 
correction can be further reduced by comparing, by computer, the 
text to a core-memory dictionary of common words stored in the 
computer. Approximately 60% of the running words in textual 
material are among Thorndike's 1000 most common words. If these 
words include variants to make a 3000 word dictionary, a single 
scanning operation by the computer will reveal that only 5% of 
the 100,000 running words in the fifth-grade text are not in 
the dictionary and thus require visual verification. Of these 
5,000 words approximately 250 were identified as possibly 
incorrect and were referred to in context. Almost all of the 
250 words required correcting. 
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The third correction operation was a visual scanning of 
corrected texts, after which the word lists were generated. 
Finally, the lists were scanned by the authors and odd- looking 
words were verified or corrected. 

Though the input text was punched on IBM cards, the 
processing system is able to accept data on paper tapes, magnetic 
tapes, or photosensitive paper, enabling researchers to use 
packaged instruction programs , or other texts such as AP-UPI 
tapes available on such input media, in studies which implement 
the processing procedures used in compiling this wordlist. 

After correction of all of the input data, the second or 
processing stage was conducted. The computer merged all the 
words from all the basal readers, from pre— primer through grade 
six, into one alphabetical sequence. This is done by a scan— 
and-sort computer operation which alphabetizes the words and 
indexes their frequencies and levels of appearance into one list 
of unique words . Each word was accompanied by information which 
showed each book in which it appeared, making it easy to note 
the lowest book in which it first was used in each series* 

These listings were then printed to obtain a master file 
all unique words found in the reading series. This file gave 
unique words and listings for over 2,000,000 running words. 

Figure 1 illustrates these listings. 

At this point the rules for merging variants with roots, 
and for deleting certain classes of words were applied. 

The criteria for inclusion in the Core List were then applied 



and the words which qualified were marked , Words which 
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Figure 1 

An Example of the Information Contained in the Reading 
Series Master File Printout 



Grade Lava? 



abbreviation 


P 


Q 


1 


2 


3 


4 


5 


a 


RSI xx 


XX 


XX 


XX 


XX 


xx 


R50Q001 


XX 


RS2 XX 


xx 


XX 


XX 


XX 


xx 


R500001 


XX 


a d 

RSI XX 


xx 


XX 


XX 


XX 


R40000S 


R50Q005 


R6G0005 


RSS XX 


XX 


XX 


XX 


XX 


R400001 


XX 


XX 


additional 


RSI XX 


XX 


XX 


XX 


XX 


XX 


R500008 


R6QQG02 


R84 xx 


XX 


XX 


XX 


XX 


xx 


XX 


R600001 



(RSI is reading series 1, R5 is Sth grade in a reader series , efe.} 
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in three or more of the six reader series were included in the 
Core List. The Core List was copied out, verified, typed on 
IBM cards , and entered into the computer . 

The next step involved two operations , adding all of the 
words from the content books to the basal reader list, and 
deleting all Core words from that list. The resulting 
alphabetical list provided the raw material for the Additional 
List and the four Content lists , Variants were merged and 
deletions made again. 

The Additional List, consisting of words found in four or 
more different series (excluding Core words), was then selected 
by research assistants and reviewed by the authors. With the 
Addihi&nalcLists avail able y thd alphabetized word list for each 
content area #as gone over and those words which satisfied the 
criteria for the particular content area were marked and verified. 
The four Technical Lists were copied out and entered into the 
computer . - 
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At this point, all the data needed for the Total Alphabetical 
List had been assembled. A series of computer operations merged 
all of the separate lists into the Total Alphabetical List of 
7,613 words, 16,849 when inflected forms are included. To do 
this, each word appearing in at least one of the component lists 
(Core, Additional and Content) was listed. 

After completing the processing of the lists the third 
stage or computer printout was made. Figure 2 illustrates this 
printout. The Total List presents information about the list 
in which the word appeared such as Core, Additional, or Content 
and identified each series (reader or content) and level in 
which the word appeared. Because of the rules for inclusion of 
inflected forms, the Total Alphabetic List contains all unique 
words, lists their inflected forms, and lists the stipulated 
special inflected forms as unique words. 

In addition to containing all of the unique words that are 
in each of the other lists, the Total Alphabetical List provides 
for each word all of the essential information used in assigning 
tiie words to the respective lists . 
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II. Comparison Procedures 



A computer program capable of comparison of word list 
content seems useful for a variety of reasons. Moat obvious 
is facilitation of comparison of word list content according to 
criteria of range, scope, or form of words which should be 
included, A more subtle application might be the comparison 
of lists and the materials constructed with them in order to 
identify differences created by the passage of time, or some 
other factor. 



Some of the lists in widespread use today were developed 
as many as fifty years ago. A computerised comparison procedure 
allows one to evaluate the differences between old lists and 
modern ones according to criteria of obsolescence in vocabulary. 
In effect, the process of aging can be isolated and identified. 



making the evaluation of the usefulness of old lists and the 



materials which they were used to develop a feasible task. As 
new lists are developed, their content can be compared, allowing 
users to evaluate the relative usefulness of one or another. 



The procedure used to enable an automated comparison of word 
list content involved the punching of several lists onto IBM 
cards , then programming the computer to sort the words , compare 
them for correspondence, check for correSptohdence or variation 



in level assignment; and print out the;; results in verbai form. 

This has been done in a comparison of the Harris -Jacobs on Basic 
Elementary Vocabularies 'ClJiiw&thUthe Dale list of 3,000 words, (2), 



the Botel list i (3) , and the Taylor list for grades 1-8 and 
grades 9-13 (4) . The words were punched sequentially, separated 
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by commas or spaces and followed by level information* 

The computer processing can be broken into two stages. The 
first stage receives and stores the raw data of the lists, 



automatically alphabetizing the words. This stage of the program 
forms a file constituting a single list of the words contained 
in all the lists, in effect merging the lists to be compared. 
Every word contained in the lists is recorded once in 
alphabetical order. Each word is accompanied by a mask 96 



columns long , allowing the recording of 96 pieces of information 
for each word, such as the lists in which it appears. These 
columns could be alotted so as to record level assignments or 
other categorizations made by Harr is- Jacobson and compilers of 
the other lists. For instance, the Harris- Jacobson list is 
composed of Core , Additional , and Content vocabularies , and the 
Core and Additional vocabularies are stratified by grade level. 
Thus, the columns of the mask could be alotted so as to indicate 
the composite list and/or the grade level in which a word appears. 

The next group of bits could be alotted to the next list, 
broken down according to its assigned levels or categories and 
so on. The file composed by this first stage of the program 
incorporates facilities for generating new information, for 
updating, or for correction of the existing data. 

The second stage of the program)readS ' through I the fil#: 
compiled by the first stage, and prints and tallies the merged 
lists. This printer stage of the program inputs a list of the 
potential titles to be sought in the mask of the stage-one 
file, checks the columns for the requisite information?, and prints 
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the words with the appropriate titles . The result is a listing 
with all the words contained in all the word lists appearing in 
alphabetical order along the left margin. Next is a space in 
which the presence or absence of the word in the master list can 
be noted. To the right the comparison list in which the word 
appears are shown. The print thus records the unique words of 
each list, the words which appear in more than one list and 
where they are matched, and records level information for each 
word if such information is provided by the compilers of the 
list. This print-out can be easily read, and the nature of 
the matched and unmatched words can be observed. 



In addition 1 to the print out of the merged and compared 
lists , the program tallies information about the results , such 



as the number of words in both of two lists , the number of words 
in one list not in the other, the number of matched words which 



have been assigned to the same level by both compilers , or 
similarly, different levels. Categorical information supplied 



by the compilers can be noted as criteria in the comparison. 
Further, the program can print out a list of matched words without 
unmatched words, or the unmatched words form any list without the 



matches. 



The data for the study consisted of four word lists. The 



first was the Harris-Jacobson Basic Elementary Reading Vocabulary 
recehtlydevelopedby Albert Harris and n^self(l)»TheH“J 



computer list for this study includes foth the Harris-Jacobson 




This list wais compared to three other word lists 
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the Dale list of 3,000 common words developed by Edgar Dale (2). , 
the. Bote 1 -Buck s County list of 1,185 common words developed by 
Morton Botel (3) , and the EDL vocabulary developed by Stanford 
Taylor and others (4) . The EDL vocabulary was broken into two 
sublists which were compared independently, one for levels 1-8 
and one for levels 9-13. The results of the comparison are 
shown in Table 1. 

Of the 2,946 words in the Dale list, 2,744 or 93 percent 
also appear in the Harris- Jacobson List. Of the 3,266 words in 
the Botel List (including inflected forms), 3,095 or 94 percent 
are also in the Harris -Jacobson List. Thus the overlapping 
among these three lists is quite high. The degree of overlapping 
with the two Taylor lists is lower. Of the 6,714 Taylor words 
for grades one through eight, 5,473 por 81 percent are also in 
the Harris- Jacobson list. This is not surprising, since the 
Harris -Jacobson list stops at sixth grade and the Taylor list 
includes seven and eight. The Taylor high school list shows still 
less overlapping. 

While these tallies are interesting, the output of this 
comparison program provides a means for a detailed content 
analysis to discover the reasons for differences or overlap 
between texts . The matched and mismatched words can be 
scrutinized to ascertain what factors or features of the various 



lists might explain the results of a comparison. 




TABLE I 
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COMPARISON OP THE HARRIS- JACOBSON BASIC ELEMENTARY 
READING VOCABULARY WITH FOUR OTHER WORD LISTS 



- 


LIST BEING COMPARED 




Dale List 


Botel List 


Taylor (1-8) 


Taylor (9-13) 


Total Number of 
Words in Harris- 
Jacobson List 


16,849 


16,849 


16,849 


16,849 ' 


Total Number of 
Words in I 

Comparison List 


j 

2,946 ! 

! 

3 

i 


3,266+ 

, ■ 


6,714 


2,426 


| 

Number of Words 
in Harris- -1 

Jacobson That Are i 
Not in Comparison ! 
List j 


i . i 

| 14,105 i 

j .j 

j * 


13,754 


11,376 


16,670 


Number of Words 
in Both Lists 


t 

2,744 ’ 


3,095 

' 


5,473 | 


I 

| 179 


Number of Words 
in Comparison Not 
in Harris - 

Jacobson 


202 


171 


. . p , ; 

1,241 


. * ■ . .. 

2,247 

* * • . . * • • • • j 
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*Harris and Jacobson^ Basic Elementary- Reading Vocabulagj.es 
Of the 16 , 849 entries 7 , 612 are root words .in the published lists 



and 9,237 are inflected forms not printed as separate entries. ;,:* • 

+Basically 1,135 words. When separate entries are made for each - - • 
variant form it consists of 3,266 words (example: beat, beats , 

■ " V - ■ •; -b- > ; ■■ -.•■■.'-V,'- ■ -■ - : ■“ ■ " ■ " •' d-../' '■■■■■ ■ ' ■ - ' ' ' ‘ ■ 

■ ■■ ••• • ; • ■; ■ : '• .r ' ' • '' -C \ ... .... 

. ■. ■ • ; . . .. : ; ;-p- -y-r ■-*- 1 , '■ .... r 
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